Optimal ranking in networks with community structure 
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The World-Wide Web (WWW) is characterized by a strong community structure in which groups 
of webpages (e.g. those devoted to a common topic or belonging to the same organization) are 
densely interconnected by hyperlinks. We study how such network architecture affects the average 
Google rank of individual communities. Using a mean-field approximation, we quantify how the 
average Google rank of community webpages depends on the degree to which it is isolated from the 
rest of the world in both incoming and outgoing directions, and a - the only intrinsic parameter of 
Google's PageRank algorithm. Based on this expression we introduce a concept of a web-community 
being decoupled or conversely coupled to the rest of the network. We proceed with empirical study 
of several internal web-communities within two US universities. The predictions of our mean-field 
treatment were qualitatively verified in those real-life networks. Furthermore, the value a = 0.15 
used by Google seems to be optimized for the degree of isolation of communities as they exist in the 
actual WWW. 



PACS numbers: 



i.20.Hh, 05.40.Fb, 



i.75.Fb 



The World Wide Web (WWW) - a very large (- 10^° 
nodes) network consisting of webpages connected by hy- 
perlinks - presents a challenge for the efficient informa- 
tion retrieval and ranking. Apart from the contents of 
webpages, the network topology around them could be a 
rich source of information about their relative importance 
and relevance to the search query. It is the effective uti- 
lization of this topological information that advanced 
the Google search engine to its present position of the 
most popular tool on the WWW and a profitable com- 
pany with a current market capitalization around $80 
billion. As webpages can be grouped based on their tex- 
tual contents, language in which they are written, the 
organizations to which they belong etc, it should come 
as no surprise that the WWW has a strong community 
structure |2j in which similar pages are more likely to con- 
tain hyperlinks to each other than to the outside world. 
Formally a web community can be defined as a collection 
of webpages characterized by an above-average density 
of links connecting them to each other. 

In this letter, we are going to address the follow- 
ing question: how does the relative isolation of commu- 
nity's webpages from the rest of the network affects their 
Google rank? In addition we would speculate the param- 
eters of Google's PageRank algorithm were selected for 
its optimal performance given the extent of the commu- 
nity structure in the present WWW network. 

In the heart of the Google search engine lies the PageR- 
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ank algorithm determining the global "importance" of 
every web page based on the hyperlink structure of the 
WWW network around it. When one enters a search 
keyword such as e.g. "statistical physics" on the Google 
website the search engine first localizes the subset of web- 
pages containing this keyword and then simply presents 
them in the descending order based on their PageRank 
values. While the details of the PageRank algorithm 
have undoubtedly changed since its introduction in 1997, 
the central "random surfer" idea first described in re- 
mained essentially the same. From a statistical physics 
standpoint the PageRank simulates an auxiliary diffu- 
sion process taking place on the network in question. A 
large number of random walkers are initially randomly 
distributed on the network and are allowed to move along 
its directed links. Similar diffusion algorithms have been 
recently applied to study citation and metabolic networks 
Pi and the modularity of the Internet on the hardware 
level represented by an undirected network of intercon- 
nections between Autonomous Systems As in real 
web surfing, a random walker of the PageRank algorithm 
could "get bored" from following a long chain of hyper- 
links. To model this scenario, the authors introduced a 
finite probability a for a random walker to directly jump 
to a randomly selected node in the network not following 
any hyperlinks. This leaves the probability I — a for it to 
randomly select and follow one of the hyperlinks of the 
current webpage. According to Q, in the real PageR- 
ank algorithm a was chosen to be 0.15. The algorithm 
then simulates this diffusion process until it converges to 
a stationary distribution. The Google rank (PageRank) 
G(i) of a node i is proportional to the number of random 
walkers at this node in such a steady state, and is usu- 
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ally normalized by (G(i)) = 1. In this normalization, the 
flux of walkers entering a given site due to random jump 
from all the other nodes is given by X^iLi oiGi/N = a. 
The continuity equation for this diffusion process reads 
G{i) = a + - a)-K^^- Here KoutU) denotes 

the number of hyperlinks (the out-degree) of the node j 
and the summation goes over all nodes j that have a hy- 
perlink pointing to the node i. In the matrix formalism 
the PageRank values are given by the components of the 
principal eigenvector of an asymmetric positive matrix 
related to the adjacency matrix of the network. Such 
eigenvector could be easily found using a simple iterative 
algorithm. In order for this one needs all nodes to satisfy 
Kout(i) > 0. Practically, it is done by iteratively remov- 
ing pages with zero out-degrees from the network 
Consider a network in which Nc nodes form a commu- 
nity characterized by an above-average density of edges 
linking these nodes to each other. Let Ecw to denote the 
total number of hyperlinks pointing from nodes in the 
community to the outside world, while E^c - the total 
number of hyperlinks pointing in the opposite direction. 
As the Google rank is computed in the steady state of the 
diffusion process , the total current of surfers Jew leaving 
the community must be precisely balanced by the oppo- 
site current J^c of surfers entering the community. Note 
that both Jew and J^c consist of two contributions: the 
current via the direct hyperlinks between the community 
and the outside world, and the current due to random 
jumps. 

Let Gc — {G{i))i£c to denote the average Google rank 
of webpages inside the community. The average current 
flowing along a hyperlink pointing away from the commu- 
nity is given by (1 — a)Gc/ {Kout)c and the total current 
leaving the community along all those out-going links 
is (1 — a)EcwGc/ {Koutjc- The total number of random 
walkers residing on nodes inside the community is GcNc 
and the probability of a random jump to lead to a node 
outside the community is Nw/{Nc + Nw), which is close 
to 1 as A'c ^ Nw ■ The contribution to the outgoing cur- 
rent due to such jumps is given by aGcNc, and thus the 
total outgoing current is Jew = (1 - a)GcEcw/ (Kout)c + 
aGcNc- Similarly the incoming current Jwc is given by 
(1 — a)GwEwc/ {Kout)w + aGwNc- Equating these two 

, Gc il-a)Ewc/{{Kout)wNc) + a 
currents one gets = -rr, k r ■ 

^ Gw (1 - a)Ecw/{{K,ut)cNc) + a 

One may notice that {Kout)wNc and {Kout)cNc are re- 

(r) (r) 

spectively equal to Ewe and EcJ - expected numbers of 
links connecting the community to the outside world in 
a random network with the same degree sequence as the 
network in question |^. By approximating Gw « 1, we 
finally arrive at the following equation: 



Roughly speaking, Rcw and Rwc quantify how isolated 
is a given community in both directions connecting it to 
the outside world. In fact, in most communities both 
ratios Rwc and Rcw are below 1 because E^c and Ecw 
are typically less than their expected values in a ran- 
domized network 0. One implication of the EqCJis that 
the average Google ranking of a community depends on 
the pattern of their connections with the outside world 
through the ratios Rcw and Rwc- For example if Rwc 
is close to 1 (i.e. the number of links pointing to the 
community is roughly the same as in a random network 
with the same degree distribution) , Gc gets its maximum 
value 1/a when Rcw <SC ct, which could be interpreted as 
the community very isolated in the out-direction. On 
the contrary, if the number of out-going links from the 
community to the outside world is roughly the same as 
in a corresponding randomized network, Gc attains its 
minimum value of a if the community is very isolated in 
the in-direction {Rwc Oi)- From Eq^one could easily 
see that the relative values of isolation ratios Rcw, Rwc 
and the parameter a determines the sensitivity of Gc to 
community's connections with the outside world. If ei- 
ther Rcw or Rwc is comparable to a, Gc is sensitive to the 
exact number of links connecting the community to the 
outside world in this particular direction. Conversely, if 
both Rwc, Rcw ^ a the average Google rank of commu- 
nity is no longer sensitive to its outside connections, and 
its value is close to 1 which is the overall average value 
of Gi for all nodes. In this case, we would refer to this 
community as being "decoupled" from the outside world. 
Of course, whether a community is decoupled or coupled 
depends on the value of a. A community decoupled at 
a particular a could become coupled if a smaller a is 
chosen. 




Gc 



(1 



(1) 



For simplicity of notation, let us refer to the ratios 
Ewc/Ewl and Ecw /Ecw as Rwc and Rcw respectively. 



FIG. 1: The average Google rank Gc of different communi- 
ties as a function of the parameter a. The communities are 
within real WWW networks of two US universities (see Table 
Elfor details). The data points are obtained by running the 
PageRank algorithm for different values of a. Solid lines are 
two-parameter best fits to the data with the Eq0 
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To empirically investigate the interplay between Gn 
and a in real World-Wide Web, we downloaded Q 
complete sets of hyperlinks contained in all webpages 
within two US universities. We then studied intra- 
univcrsity communities based either on common inter- 
ests (like schools or departments) or common geographic 
locations (like individual campuses of a large university 
system). (See Tabled for details.) The relation between 
Gc and a for six such communities are shown in Fig^ 
As expected from our calculations, as a is lowered in all 
these communities Gc starts to significantly deviate from 
1. Moreover, the community "UCLA social science" devi- 
ates upward while all the others deviate downward. This 
could be qualitatively explained by the EqQ] with the 
observation that R^c is greater than Rcw in this com- 
munity, while Rwc is less than Rcw in all the others (see 
Table . Furthermore, by looking at which values of a 
does Gc starts to significantly deviate from 1, one can see 
that different communities become coupled to the outside 
world for different a's. For example, "UCLA Library" 
and "UCLA Academic Tech. Service" reach the level 
of Gc = 0.8 when a is around 0.2 - 0.3, while "UCLA 
Anderson School of Management" and "LIU CWP cam- 
pus" reach the same level of coupling only for much lower 
a « 0.01 -0.05. 

We would like to point out that the Eq^is based on 
a "mean-field" assumption. The average Google ranks 
and out-degrees of community nodes sending links to the 
outside world are assumed to be equal to the overall av- 
erage values inside the community, and the same is as- 
sumed for the nodes in the outside world that have links 
to the community. Of course this is never perfectly true 
for real web-communities. For example, a community 
may be linked from the outside world by a highly ranked 
authority page, and receive an in-coming current larger 
than predicted by our mean-field calculation. Conversely, 
it can only get links from relatively unimportant pages 
which would result in our mean-field model overestimat- 
ing the actual current. There is no universal rule for 
estimating even the sign of the deviation from the mean 
field predictions. Thus it is impossible to calculate "cor- 
rections" to our mean-field formula. Instead those correc- 
tions have to be considered on a case-by-case basis. By 
allowing parameters Rcw and Rwc in the Eq^to deviate 
from their values prescribed by the mean-field theory pro- 
vides a simple mathematical formalism to quantify those 
corrections for real communities. We define i?*^ and 
from the two-parameter best fit of the actual Gc(a) de- 
pendence in a given community with the Eq^ (see Table 
UTI ) One may regard R*cw and Rwc as effective parame- 
ters, which in addition to simple geometrical properties 
of the community such as numbers of links connecting 
it to the outside world, take into account Google ranks 
of actual pages sending those links. These "renormal- 
ized" ratios i?*^ and 7?^^ would be more accurate than 
their "raw" counterparts {Rcw and Rwc) in determining 
whether a particular web-community is coupled to or de- 
coupled from the outside world at a given value of a. 



TABLE I: The basic statistics about the academic WWW 
networks downloaded from Ref. 0- We choose to study 
hyperlink networks within the Long Island University (LIU, 
29476 nodes and 160457 edges) and separately within the Uni- 
versity of California at Los Angeles (UCLA, 135533 nodes and 
636595 edges). Following Google's original recipe Q we itera- 
tively removed webpages with zero out-degree. The resulting 
networks consist of 15471 nodes and 90111 edges for the LIU 
and 31621 nodes and 353370 edges for the UCLA. We then 
studied several large communities defined by the URL of their 
servers (e.g. .hbrary.ucla.edu for the "UCLA Library" com- 
munity. ) 



Community 


N, 


Ecc 






E 


UCLA Library 


2028 


23062 


1699 


755 


2141 


UCLA School of Management 


1340 


15983 


739 


175 


169 


UCLA Academic Tech. Services 


1907 


26597 


2248 


139 


3113 


UCLA Social Science Division 


626 


3986 


50 


258 


142 


UCLA Humanity Division 


864 


4846 


79 


397 


445 


LIU CWP Campus 


2756 


18376 


4105 


336 


1393 



TABLE II: Rcw, Rwc, Rcw and _R^c for different communi- 
ties. Rcw and Rwc are obtained by counting the links from 
the community to the world and vice versa, divided by the 
corresponding number of links in a random network with the 
same degree distribution . 7?*„ and R'^c are result of fitting 
the Gc and a dependency via Eq^ 



Community 


Rwc 




p* 

^wc 


R-cw 


UCLA Library 


0.04 


0.09 


0.02 


0.07 


UCLA School of Management 


0.01 


0.01 


0.005 


0.006 


UCLA Academic Tech. Services 


0.007 


0.1 


0.003 


0.07 


UCLA Social Science Division 


0.04 


0.03 


0.02 


0.01 


UCLA Humanity Division 


0.04 


0.08 


0.05 


0.07 


LIU CWP Campus 


0.03 


0.09 


0.01 


0.02 



The effective ratios R*^^ and i?^^ for the six communi- 
ties used in our study are listed in the Table UTI and visual- 
ized in FigI21 Generally speaking, the closer to the origin 
is a community in this figure, the lower is the value of a 
at which it first becomes coupled to the outside world. 
One could see that for a = 0.15, which is the actual value 
used by the Google 5] , all of our six communities are es- 
sentially decoupled from the outside world. However, if 
a much smaller value of a (say 0.01) is chosen, 5 out of 6 
of our communities (all except for the "UCLA Anderson 
School of Management" ) would become sensitive to their 
connections with the outside world. In principle, Fig[21 
might be extended to include the region where i?*^, and 
Rwc are above one, but by definition those points are 
not referring to well-defined communities. From Eq^it 
follows that it is the asymmetry between Rcw and Rwc 
which determines whether Gc is greater than or less than 
1. Thus the diagonal in Fig. [51 separates communities 
with Gc > 1 from those with Gc < 1. The ratio between 
the X- and y-coordinates of the community in this plot 
determines the asymptotic value of its Google rank Gc 
for a close to zero. Thus the two communities: "UCLA 
Academic Tech. Service" and "UCLA Social Science", 
whose ratios between their x~ and y— coordinates in 
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this plot are respectively the smallest and the largest in 
our set deviate the most from Gc = 1 as shown in Fig^ 



No Community 




wc 



FIG. 2: and R^uc for different communities. Communities 
inside the lightly shaded square are decoupled from the rest of 
the world for a = 0.15, while the ones inside the dark shaded 
square are decoupled for a — 0.01. 

The dominance of Google and the all-important role 
of its ranking led to the appearance of services offer- 
ing "search engine optimization" to their clients. They 
promise to modify the content and the hyperlink struc- 
ture of client's webpages to improve their Google rank. 
Our findings suggest one obvious way how such an "opti- 
mization" could be achieved: the number of links point- 



ing to the outside world should be reduced to the min- 
imum while the number of intra-community hyperlinks 
is kept at the maximum. However, as we demonstrated 
above the success of such a strategy depends on whether 
or not the community in question is coupled to the out- 
side world. Indeed, the average Google rank of a de- 
coupled community is virtually insensitive to the exact 
balance of hyperlinks connecting it to the outside world 

Since coupling of web-communities to the outside 
world and the resulting ability of their webmasters to 
artificially boost the ranking is undesirable for a search 
engine, it should come as no surprise that the internal 
parameter a chosen by the Google's team is carefully se- 
lected to minimize this effect. To make most of the com- 
munities decoupled the value of a in the PageRank algo- 
rithm should be as large as possible. On the other hand, 
for very large a the algorithm does not take into account 
also the relevant network properties of the WWW. In- 
deed for a close to 1, random surfers rarely follow hyper- 
links and thus nearly all topological information about 
the network is lost. Therefore, the optimal value of a 
should be chosen based on the realistic values of isola- 
tion parameters Rcw and Rwc- In our study we found all 
the communities to be effectively decoupled at a = 0.15 
but not at smaller values of a (e.g a — 0.01 shown as 
a dark shaded square in Fig|21l. Thus for our sample 
of web-communities the value a = 0.15 proposed in ^ 
indeed optimizes Google's goals by striking the best pos- 
sible balance between the two opposing demands on the 
value of a. 

Work at Brookhaven National Laboratory was carried 
out under Contract No. DE-AC02-98CH10886, Division 
of Material Science, U.S. Department of Energy. 
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